Clustering Research across Tibetan and Chinese Texts

نویسندگان

  • G. X. Xu
  • W. Sun
  • X. P. Peng
چکیده

Tibetan text clustering has potential in Tibetan information processing domain. In this paper, clustering research across Chinese and Tibetan texts is proposed to benefit Chinese and Tibetan machine translation and sentence alignment. A Tibetan and Chinese keyword table is the main way to implement the text clustering across these two languages. Improved Kmeans and improved density-based spatial clustering of applications with noise (DBSCAN) algorithm are proposed. Experiments show that improved K-means algorithm gains stable text clustering result and performs better than traditional K-means after eliminating the limitation of random selection of initial k data. The improved DBSCAN algorithm obtains good performance through reasonable parameter setting. Improved DBSCAN performs better than improved K-means. The study is helpful and meaningful for the parallel corpus construction of Chinese and Tibetan texts. Subject Categories and Descriptors I.5.3 [Clustering]: Algorithm; I.2.7 [Natural Language Processing]: Text Analysis General Terms: Algorithm, Performance

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tibetan Text Clustering Based on Machine Learning

Tibetan information processing technology has been obtained some achievements. But it falls behind Chinese and English information processing. It still needs to be paid more attention. Text clustering has the potential to accelerate the development of Tibetan information processing. In this paper, we propose an approach of Tibetan text clustering based on machine learning. Firstly, the approach...

متن کامل

Tibetan-Chinese Cross Language Text Similarity Calculation Based on LDA Topic Model

Topic model building is the basis and the most critical module of cross-language topic detection and tracking. Topic model also can be applied to cross-language text similarity calculation. It can improve the efficiency and the speed of calculation by reducing the texts’ dimensionality. In this paper, we use the LDA model in cross-language text similarity computation to obtain Tibetan-Chinese c...

متن کامل

Chinese-Tibetan bilingual clustering based on random walk

In recent years, multi-source clustering has received a significant amount of attention. Several multi-source clustering methods have been developed from different perspectives. In this paper, aiming at addressing the problem of Chinese–Tibetan bilingual document clustering, a novel bilingual clustering scheme is proposed, which can well capture both the intralingua document structures and inte...

متن کامل

Study on Hot Topic Discovery from Chinese Texts

With the development of information technology, there has been an increased popularity in the use of electronic texts. Topic detection and tracking can identify hot information from isolated texts. Obtaining hot topics has become an important issue in recent years. The combination of statistics and natural language processing was utilized in the current study to discover hot topics from texts. ...

متن کامل

Chinese Name Disambiguation Based on Adaptive Clustering with the Attribute Features

To aim at the evaluation task of CLP2012 named entity recognition and disambiguation in Chinese, a Chinese name disambiguation method based on adaptive clustering with the attribute features is proposed. Firstly, 12-dimensional character attribute features is defined, and tagged attribute feature corpus are used to train to obtain the recognition model of attribute features by Conditional Rando...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015